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(57) Abstract 

A method for determining the amino acid sequence of an unknown peptide comprising (a) determining a molecular mass and an 
experimental fragmentation spectrum for the unknown peptide; (b) comparing the experimental fragmentation spectrum of the unknown 
peptide to theoretical fragmentation spectra calculated for a peptide library composed of all possible linear sequences of amino acids having 
a total mass that corresponds to the molecular mass of the unknown peptide; and (c) identifying a peptide in the peptide library having a 
theoretical fragmentation spectrum that matches most closely the fragmentation spectrum of the unknown peptide. 
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A METHOD FOR DE NOVO PEPTIDE SEQUENCE DETERMINATION 

FIELD OF THE INVENTION 

The invention relates to a method for the determination of the precise linear 
5 sequence of amino acids in a peptide, polypeptide or protein, without recourse or 
reference to either a known pre-defined data base or to sequential amino acid residue 
analysis. As such, the method of the invention is a true, de novo peptide sequence 
determination method. 

1 0 BACKGROUND OF THE INVENTION 

The composition of a peptide (which term includes also polypeptide or protein) 
as a sequence of amino acids is well understood; Each peptide is uniquely defined by a 
precise linear sequence of amino acids. Knowledge of the precise linear arrangement or 
sequence of amino acids in a peptide is required for various purposes, including DNA 

15 cloning in which the sequence of amino acids provides information required for 
oligonucleotide probes and polymerase chain reaction ("PCR") primers. Knowledge of 
the exact sequence also allows the synthesis of peptides for antibody production, 
provides identification of peptides, aids in the characterization of recombinant products, 
and is useful in the study of post-translational modifications. 

20 A variety of sequencing methods are available to obtain the amino acid sequence 

information. For example, a series of chemical reactions, e.g., Edman reactions, or 
enzymatic reactions, e.g., exo-peptidase reactions, are used to prepare sequential 
fragments of the unknown peptide. Either an analysis of the sequential fragments or a 
sequential analysis of the removed amino acids is used to determine the linear amino acid 

2 5 sequence of the unknown peptide. Typically, the Edman degradation chemistry is used 

in modern automated protein sequencers. 

In the Edman degradation, a peptide is sequenced by degradation from the 
N-terminus using the Edman reagent, phenylisothiocyanate (PITC). The degradation 
process involves three steps, i.e., coupling, cleavage, and conversion. In the coupling 

3 0 step, PITC modifies the N-terminal residue of the peptide, polypeptide, or protein. An 

acid cleavage then cleaves the N-terminal amino acid in the form of an unstable 
anilinothiazolinone (ATZ) derivative, and leaves the peptide with a reactive N-terminus 
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and shortened by one amino acid. The ATZ derivative is converted to a stable 
phenylthiohydantoin in the conversion step for identification, typically with reverse phase 
high performance liquid chromatography (RP-HPLC). The shortened peptide is left with 
a free N-terminus that can undergo another cycle of the degradation reaction. Repetition 
5 of the cycle results in the sequential identification of each amino acid in the peptide. 
Because of the sequential nature of amino acid release, only one molecular substance can 
be sequenced at a time. Therefore, peptide samples must be extremely pure for accurate 
and efficient sequencing. Typically, samples must be purified with HPLC or SDS-P AGE 
techniques. 

1 o Although many peptide sequences have been determined by Edman degradation, 

currently, most peptide sequences are deduced from DNA sequences determined from 
the corresponding gene or cDN A. However, the determination of a protein sequence 
using a DNA sequencing technique requires knowledge of the specific nucleotide 
sequence used to synthesize the protein. DNA sequencing cannot be used where the 
15 nature of the protein or the specific DNA sequence used to synthesize the protein is 
unknown. 

A peptide sequence may also be determined from experimental fragmentation 
spectra of the unknown peptide, typically obtained using activation or collision-induced 
fragmentation in a mass spectrometer. Tandem mass spectrometry (MS/MS) techniques 

2 0 have been particularly useful. In MS/MS, a peptide is first purified, and then injected into 

a first mass spectrometer. This first mass spectrometer serves as a selection device, and 
selects a target peptide of a particular molecular mass from a mixture of peptides, and 
eliminates most contaminants from the analysis. The target molecule is then activated or 
fragmented to form a mixture from the target or parent peptide of various peptides of a 

25 lower mass that are fragments of the parent. The mixture is then selected through a 
second mass spectrometer (i.e. step), generating a fragment spectrum. 

Typically, in the past, the analysis of fragmentation spectra to determine peptide 
sequences has involved hypothesizing one or more amino acid sequences based on the 
fragmentation spectrum. In certain favorable cases, an expert researcher can interpret the 

30 fragmentation spectra to determine the linear amino acid sequence of an unknown 
peptide. The candidate sequences may then be compared with known amino acid 
sequences in protein sequence libraries. 
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In one strategy, the mass of each amino acid is subtracted from the molecular 
mass of the parent peptide to determine the possible molecular mass of a fragment, 
assuming that each amino acid is in a terminal position. The experimental fragment 
spectrum is then examined to determine if a fragment with such a mass is present. A 
5 score is generated for each amino acid, and the scores are sorted to generate a list of 
partial sequences for the next subtraction cycle. The subtraction cycle is repeated until 
subtraction of the mass of an amino acid leaves a difference of between -0.5 and 0.5, 
resulting in one or more candidate amino acid sequences. The highest scoring candidate 
sequences are then compared to sequences in a library of known protein sequences in an 

10 attempt to identify a protein having a sub-sequence similar or identical to the candidate 
sequence that generated the fragment spectrum. 

Although useful in certain contexts, there are difficulties related to hypothesizing 
candidate amino acid sequences based on fragmentation spectra. The interpretation of 
fragmentation spectra is time-consuming, can generally be performed only in a few 

15 laboratories that have extensive experience with mass spectrometry, and is highly 
technical and often inaccurate. Human interpretation is relatively slow, and may be highly 
subjective. Moreover, methods based on peptide mass mapping are limited to peptide 
masses derived from an intact homogeneous peptide generated by specific, known 
proteolytic cleavage, and, thus, are not applicable in general to a mixture of peptides. 

2 0 U.S. Patent No. 5,538,897 to Yates, III et al. provides a method of correlating 

the fragmentation spectrum of an unknown peptide with theoretical spectra calculated 
from described peptide sequences stored in a database to match the amino acid sequence 
of the unknown peptide to that of a described peptide. Known amino acid sequences, 
e.g., in a protein sequence library, are used to calculate or predict one or more candidate 
25 fragment spectra. The predicted fragment spectra are then compared with the 
experimentally-obtained fragment spectrum of the unknown protein to determine the best 
match or matches. Preferably, the mass of the unknown peptide is known. Sub-sequences 
of the various sequences in the protein sequence library are analyzed to identify those 
sub-sequences corresponding to a peptide having a mass equal to or within a given 

3 0 tolerance of the mass of the parent peptide in the fragmentation spectrum. For each 

sub-sequence having the proper mass, a predicted fragment spectrum can be calculated 
by calculating masses of various amino acid subsets of the candidate peptide. As a result, 
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a plurality of candidate peptides, each having a predicted fragment spectrum, is obtained. 
The predicted fragment spectra are then compared with the fragment spectrum obtained 
experimentally for the unknown protein, to identify one or more proteins having 
sub-sequences that are likely to be identical to the sequence of peptides that resulted in 
5 the experimentally-derived fragment spectrum. However, this technique cannot be used 
to derive the sequence of unknown, novel proteins or peptides having no sequence or 
sub-sequence identity with those pre-described or contained in such databases, and, thus, 
is not a de novo sequencing method. 

Therefore, there remains a need for a true de novo sequencing method of 
10 determining the amino acid sequence of a peptide using mass spectrometry. 

SUMMARY OF THE INVENTION 

The present invention is directed to a method for generating a library of peptides, 
wherein each peptide in the library has a molecular mass corresponding to the same 
1 5 predetermined molecular mass. Typically, the library of peptides is then used to determine 
the amino acid sequence of an unknown peptide having the predetermined molecular 
mass. Preferably, the predetermined molecular mass used to generate the library is the 
molecular mass of the unknown peptide. Most preferably, the molecular mass of the 
unknown peptide is determined prior to the generation of the library using a mass 

2 0 spectrometer, such as a time-of-flight mass spectrometer. 

The library is synthetic, i.e., not pre-described, and is typically generated each 
time a peptide is analyzed, based on the predetermined molecular mass of the unknown 
peptide. The library is generated by defining a set of all allowed combinations of amino 
acids that can be present in the unknown peptide, where the molecular mass of each 
25 combination corresponds to the predetermined molecular mass within the experimental 
accuracy of the device used to determine the molecular mass, allowing for water lost in 
peptide bond formation and for protonation, and generating an allowed library of all 
possible permutations of the linear sequence of amino acids in each combination in the 
set. 

3 0 Generally, the present invention is directed to a method for determining the amino 

acid sequence of an unknown peptide, which comprises determining a molecular mass 
and an experimental fragmentation spectrum for the unknown peptide, comparing the 
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experimental fragmentation spectrum of the unknown peptide to theoretical 
fragmentation spectra calculated for each individual member of an allowed synthetic 
peptide library, where the allowed peptide library is of the type described above, and 
identifying a peptide in the peptide library having a theoretical fragmentation spectrum 
5 that matches most closely the fragmentation spectrum of the unknown peptide, from 
which it is inferred that the amino acid sequence of the identified peptide in the allowed 
library represents the amino acid sequence of the unknown peptide. 

The molecular mass for the unknown peptide may be determined by any means 
known in the art, but is preferably determined with a mass spectrometer. Allowed 

10 combinations of amino acids are chosen from a set of allowed amino acids that typically 
comprises the natural amino acids, i.e., tryptophan, arginine, histidine, glutamic acid, 
glutamine, aspartic acid, leucine, threonine, proline, alanine, tyrosine, phenylalanine, 
methionine, lysine, asparagine, isoleucine, cysteine, valine, serine, and glycine, but may 
also include other amino acids, including, but not limited to, non-natural amino acids and 

15 chemically modified derivatives of the natural amino acids, e.g., carbamidocysteine and 
deoxymethionine. Allowed combinations of amino acids are then calculated using one or 
more individual members of this set of amino acids, allowing for known mass changes 
associated with peptide bond formation, such that the total mass of each allowed 
combination corresponds to the predetermined mass of the unknown peptide to within 

20 the experimental accuracy to which this molecular mass of the unknown peptide was 
calculated, typically about 30 ppm. The set of allowed combinations is most easily 
calculated using an appropriately programmed computer. The allowed peptide library is 
assembled by permutation in all possible linear combinations of each allowed amino acid 
composition, and is also most easily constructed using an appropriately programmed 

25 computer. It should be noted that the term "allowed" with respect to amino acid 
combinations and libraries of peptides refers to combinations and libraries specific to the 
unknown peptide under investigation. The peptide library is constructed from the amino 
acid combinations, which in turn are calculated from the experimentally determined 
molecular mass. As unknown peptides of different mass are investigated, so different 

30 combinations of amino acids are allowed, and hence each unknown peptide of unique 
molecular mass gives rise to a unique peptide library. 
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The present invention constrains the allowed library, i.e. limits the number of 
possible sequences. In the broadest aspect of the invention, this constraint is achieved 
by determining a molecular mass for the peptide whose sequences is to be determined, 
i.e. the unknown peptide. 
5 According to preferred embodiments of the invention, information, e.g. available 

from the experimental fragmentation spectrum of the unknown peptide, can be used to 
put further constraints on the number of possible sequences of amino acids in the peptide 
library. For example, the immonium ion region of the mass spectrum used to determine 
the molecular mass may also be used to identify amino acids contained in the unknown 

10 peptide. Alternatively or in addition, the two N-terminal amino acids may be identified 
from the bj/aj ion pairs. For example, the two N-terminal amino acids may be deduced 
from the prominent signals of the b 2 and 2^ ions. In particular, the identity.of the signals 
may be determined by recognition of signals separated by 27.98 a.m.u. (corresponding 
to CO) in the region of the spectrum which includes the mass of all possible combinations 

15 of modified and unmodified amino acids. Further, based on the use of enzyme treatment, 
e.g. with a protease such as papain, chymotrypsin or trypsin, the C-terminal residue of 
any peptide in the spectrum is determined as either arginine or lysine, and this may be 
confirmed or identified from the recognition of signals at 175. 1 1 and 147. 1 1 respectively. 
Alternatively, C -terminals containing basic amino acids can be identified by recognition 

20 of the predicted yj ion. The spectrum can be interpreted to identify the next amino acids. 

Another means of applying a constraint on the allowed library of amino acids is 
to obtain partial internal sequence information, e.g. by identifying the y series of ions with 
appropriate defined accuracy of mass measurement. In particular, a computer 
programme may be used to recognise at least three sequential signals separated by the 

25 mass of all possible modified and unmodified amino acid residues. The differences 
between these signals allows identification of a sequence of two amino acids. Most 
preferably, the molecular mass of the unknown peptide and at least one other 
experimental parameter, e.g. as given above, are used as constraints in initially generating 
the library of allowed peptides. 

3 0 The nature of the fragmentation process from which the theoretical fragmentation 

spectrum is calculated for every peptide in the allowed library may be of any type known 
in the art, such as a mass spectrum or a protease or chemical fragmentation spectrum. 
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Preferably, both the molecular mass and the fragmentation spectrum for the unknown 
peptide are obtained from a tandem mass spectrometer. The amino acid sequence of the 
peptide from the allowed library of peptides, having a calculated fragmentation spectrum 
that best fits the experimental fragmentation spectrum of the unknown peptide, 
5 corresponds to the amino acid sequence of the unknown peptide. 

Although not required, the experimental fragmentation spectrum is generally 
normalized. A factor that is an indication of closeness-of-fit between the experimental 
fragmentation spectrum of the unknown peptide and each of the theoretical 
fragmentation spectra calculated for the peptide library may then be calculated to 

10 determine which of the theoretical fragmentation spectra best fits the experimental 
fragmentation spectrum. Preferably, peak values in the fragmentation spectra having an 
intensity greater than a predetermined threshold value are selected when calculating the 
indication of closeness-of-fit. The theoretical fragmentation spectrum that best fits the 
experimental fragmentation spectrum corresponds to the amino acid sequence in the 

15 allowed library that matches that of the unknown peptide. 



BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a flow chart of the method of the invention. 

Figs. 2a and 2b are flow charts of alternative preferred embodiments of the 
2 0 invention. 

Fig. 3 is the experimental mass spectrum used to determine the molecular mass 
of unknown Peptide X. 

Fig. 4 is the immonium ion region of the mass spectrum shown in Fig. 3, and 
identifies amino acids contained in unknown Peptide X. 
2 5 Fig. 5 is the experimental fragmentation mass spectrum of Peptide X. 

Fig. 6 is the experimental mass spectrum used to determine the molecular mass 
of a Peptide Y. 

Fig. 7 is the immonium ion region of the mass spectrum shown in Fig. 6, and 
identifies amino acids contained in Peptide Y. 
30 Fig. 8 is the experimental tandem mass spectrum of Peptide Y. 

Fig. 9 is the experimental mass spectrum used to determine the molecular mass 
of a Peptide Z. 
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Fig. 10 is the immonium ion region of the mass spectrum shown in Fig. 9, and 
identifies amino acids contained in Peptide Z. 

Fig. 1 1 is the experimental tandem mass spectrum of Peptide Z. 

5 DETAEbED DESCRIPTION OF THE INVENTION 

The present invention is directed to a de novo method for determining the 
sequence of an unknown peptide without reference to any experimentally determined 
peptide or nucleotide sequence, and without recourse to a sequential and step-wise 
identification and ordering of individual amino acid residues, such as the Edman 

10 degradation process or interpretation of conventional mass spectrometry fragmentation 
patterns. In the method of the invention, a library of theoretical peptide sequences is 
generated from a predetermined molecular mass, preferably the experimentally 
determined molecular mass of an unknown peptide. As such, this library must contain the 
amino acid sequence of the unknown peptide, as well as that of any other peptide having 

15 the predetermined molecular mass. The precise amino acid sequence of the unknown is 
identified by applying standard correlation functions to select that peptide from the 
synthetic library whose calculated, i.e., theoretical, fragmentation spectrum most closely 
matches the fragmentation pattern of the unknown. In the preferred embodiment, the 
fragmentation spectrum is a mass spectrum and the correlation method is the function 

20 described in U.S. Patent No. 5,538,897, the contents of which are incorporated herein 
in their entirety by reference. Preferably, the theoretical fragmentation spectra are 
generated and matched to the fragmentation pattern of the unknown using an 
appropriately programmed computer. 

The invention may be better understood by reference to the flow chart provided 

25 in Fig. 1. Where the peptide is a protein or large polypeptide, the protein or large 
polypeptide may be cleaved to form a peptide pool by means well known in the art. The 
unknown peptide ("Peptide X ft ) is then separated from the pool by HPLC or any other 
means known in the art, preferably mass spectrometry, and the molecular mass of Peptide 
X is determined. Although there are a number of methods for determining the molecular 

3 0 mass of Peptide X, the preferred method is again mass spectrometry. 

A set of amino acids that theory or experimental results teach may be included 
in Peptide X is then defined for consideration in determining the sequence of Peptide X. 
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The defined set of amino acids may include modified or unnatural amino acids in addition 
to natural amino acids. 

Typically, the method of the invention requires a "naked" peptide when 
determining the amino acid sequence. Therefore, the peptide should be free of any 
5 individual amino acids that are covalently modified by post-translational modification, 
such as, e.g., glycosylation, which involves the attachment of carbohydrate to the side 
chain of certain amino acids. Where the method of the invention is used to determine the 
amino acid sequence of a post-translationally modified peptide, the modifications are 
typically removed from the peptide prior to the analysis, taking due care to leave the 

1 o peptide intact. Methods for removing post-translational modifications from peptides are 

well known in the art, and include, for example, the removal of N-linked carbohydrates 
with enzymes, suchas peptide-N-glycosidaseF (PNGaseF), endb-glycosidases, mixtures 
of exo-glycosidases, etc., and the removal of phosphate modification with phosphatases. 
In addition, other techniques for removing modifications occasionally found on peptides 
15 are well known in the art. However, where a specific modification to a specific amino 
acid is known to be present in the unknown peptide, the modified amino acid may be 
included in the defined set of amino acids that theory or experimental results teach may 
be included in Peptide X, and, thus, the sequence of the peptide containing the modified 
peptide may be determined with the method of the present invention. 

2 0 All combinations of amino acids having a total mass equal to the measured mass 

of Peptide X are calculated, allowing for water lost in determining peptide links, 
protonation, etc. Any individual amino acid may be included as part of any given 
combination at any integral stoichiometry up to the amount consistent with the mass 
determined for Peptide X. These combinations comprise all of the allowed combinations 

25 of amino acids combinations for Peptide X, and, therefore, the actual amino acid 
compositions of Peptide X will be represented in one and only one of these combinations. 
Furthermore, these combinations are generally peptide-specific. 

An allowed library of linear peptides is then constructed from the allowed 
combinations of amino acids. The allowed library is constructed by generating all possible 

30 linear permutations of the sequence of amino acids in each combination, using all the 
amino acids in each combination. The allowed library comprises all such permutations of 
the amino acids, and therefore must include Peptide X. The allowed library of peptides 
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having the same molecular mass as Peptide X is typically constructed independently and 
ab initio for each new unknown peptide that is sequenced. That is, a new library is 
typically constructed as part of each analysis, and for only that analysis. However, as will 
be clear to one of ordinary skill in the art, once a library of all peptides having a given 
5 molecular mass has been constructed, that library may be used for the determination of 
the amino acid sequence of any other peptide of that particular molecular mass. 

This differs fundamentally from existing data base approaches in which a single 
data base of known sequences, which is subject to periodic updates and refinements 
based on the availability of experimentally determined sequences, is used for all analyses. 

10 As a result, with the method of the present invention, the determination of new and 
previously unknown peptides sequences that are not present in any experimentally 
determined peptide sequence library is possible by direct peptide analysis in a 
non-step-wise, operator-independent automated process. In addition, the method of the 
invention is not constrained to the conventional twenty amino acids, or to their 

15 conventional modifications. 

In a preferred embodiment, as shown in the flow chart provided in Fig. 2a, 
additional information relating to Peptide X is used to place constraints on the allowed 
combinations of amino acids and/or allowed peptide sequences in the library, and, thus, 
reduce the number of possible sequences. Useful information related to Peptide X 

20 includes, but is not limited to, partial amino acid composition. For example, the mass 
spectrum used to determine the mass of Peptide X may include fragments that can be 
used to identify specific amino acids present in Peptide X. Where it is known that certain 
amino acids are definitely present in Peptide X, constraints are placed on the allowed 
combinations and allowed library by requiring the identified amino acids to be present in 

25 all combinations and, thus, in every peptide present in the library. 

Fig. 2b illustrates a system whereby more than one constraint is put on the library 
of possible linear sequences. By way of illustration only, for each peptide to be analysed 
(whether purified or present in a mixture), information on its mass (e.g. by MALDI-MS) 
and a tandem mass spectrum from it (e.g. by FSI-tandem MS) are obtained. The tandem 

30 mass spectrum can then be interpreted in an automated manner, to obtain certain 
information about the unknown peptide. Suitable software evaluates the following 
information, when possible, from a tandem MS spectrum in an automated manner: 
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i. Information on amino acids contained in the peptide by analysing the 
immonium ions region. 

ii. Identification of the two N-terminal amino acids by identifying the b^ 
ion pairs. 

5 iii. Based on the use of trypsin, the C-terminal residue must be lysine or 

arginine. These can be identified in the spectrum and the spectrum 
interpreted to find the next amino acids, 
iv. Partial internal sequence information can be obtained by identifying y 
- series of ions with defined accuracy of mass measurement at £ 100 ppm. 
10 A discussion of manual spectrum interpretation is provided in Medzihradsky and 

Burlingame, A Companion to Methods in Enzymology 6: 284-303 (1994). 

Again with reference to Figs. 1 and 2; the allowed library, which has preferably 
been constrained, is then used as the basis for generating theoretical fragmentation 
patterns that are compared to the experimental fragmentation pattern obtained for 
15 Peptide X. The fragmentation patterns may be obtained by any suitable means known in 
the art. Preferably, the fragmentation patterns are mass spectra, and the method used to 
match the theoretical and experimental mass spectra is that disclosed in U.S. Patent No. 
5,538,897. However, protease or chemical fragmentation, coupled to HPLC separation 
of the fragments, may also be used to obtain the experimental fragmentation patterns. 
20 Preferably, in a determination of the amino acid sequence of Peptide X, the 

molecular mass of Peptide X is determined with high accuracy, typically, to within about 
30 ppm (parts per million). An example of such a spectrum is provided in Fig. 3, where 
the molecular mass of Peptide X is determined from the peak at 774.3928 daltons. In 
addition, as a result of the partial fragmentation of Peptide X that can occur, fragments 
25 that identify certain amino acids that are contained in Peptide X are also observed, 
allowing the peptide library to be constrained. An example of this portion of the mass 
spectrum for Peptide X is provided in Fig. 4. 

Peptide X is then subjected to collision-induced dissociation in a mass 
spectrometer. The parent peptide and its fragments are then introduced into the second 
3 0 mass spectrometer that provides an intensity or count and the mass to charge ratio, m/z, 
for each of the fragments in the fragment mixture. Each fragment ion is represented in 
a bar graph in which the abscissa value is m/z and the ordinate value is the intensity. A 
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variety of mass spectrometer types can be used, including, but not limited to, 
triple-quadrapole mass spectrometry, Fourier-transform cyclotron resonance mass 
spectrometry, tandem time-of-flight mass spectrometry, and quadrapole ion trap mass 
spectrometry. 

5 The experimental fragment spectrum is then compared to the mass spectra 

predicted for the sequences of the allowed library, to identify one or more predicted mass 
spectra that closely match the experimental mass spectrum. Because the allowed library 
includes all permutations of amino acid sequences that have a total mass corresponding 
to that of Peptide X, Peptide X must be represented in the allowed library. 

10 The predicted fragmentation spectra may be obtained and compared to the 

experimental fragmentation spectrum by employing a method that involves first 
normalizing the experimental fragmentation spectrum. This may be accomplished by 
converting the experimental fragmentation spectrum to a list of masses and intensities. 
The peak values for Peptide X are removed, and the square root of the remaining 

15 intensity values is calculated, and normalized to a maximum value of 100. The 200 most 
intense ions are divided into ten mass regions, and the maximum intensity within each 
region is again normalized to 100. Each ion within 3.0 daltons of its neighbour on either 
side is given an intensity value equal to the greater of the intensity of the ion or that of 
its neighbour. Other normalization methods can be used, and it is possible to perform the 

20 analysis without normalizing. However, in general, normalization is preferred. In 
particular, maximum normalized values, the number of intense ions, the number of mass 
regions, and the size of the window for assuming the intensity value of a near neighbour 
may all be independently varied to larger or smaller values. 

A fragment mass spectrum is predicted for each of the candidate sequences. The 

25 fragment mass spectrum is predicted by calculating the fragment ion masses for the type 
b and y ions for the amino acid sequence. When a peptide is fragmented and the charge 
is retained on the N-terminal cleavage fragment, the resulting ion is labelled as a b-type 
ion. If the charge is retained on the c-type terminal fragment, it is labelled a y-type ion. 
Masses for type b ions were calculated by summing the amino acid masses and adding 

3 0 the mass of a proton. Masses for type y ions were calculated by summing, from the 
c-terminus, the masses of the amino acids and adding the mass of water and a proton to 
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the initial amino acid. In this way, it is possible to calculate an m/z value for each 
fragment. 

However, in order to provide a predicted mass spectrum, it is also necessary to 
assign an intensity value for each fragment. Although it is often possible to predict, on 
5 a theoretical basis, an intensity value for each fragment, this procedure is difficult, and 
it has been found useful to assign intensities in the following fashion. The value of 50.0 
is assigned to each b and y ion. To masses of 1 dalton on either side of the fragment ion, 
an intensity of 25,0 is assigned. Peak intensities of 10.0 are assigned at mass peaks 17.0 
and 18.0 daltons below the m/z of each b and y ion location, to account for both NH 3 and 

10 H 2 0 loss, and peak intensities of 10.0 are assigned to mass peaks 28.0 daltons below 
each type b ion location, to account for CO loss. 

After calculation of predicted m/z values and assignment of intensities, it is 
preferred to calculate a measure of closeness-of-fit between the predicted mass spectra 
and the experimentally-derived fragment spectrum. A number of methods for calculating 

15 closeness-of-fit are available. For example, a two-step method may be used that includes 
calculating a preliminary closeness-of-fit score, referred to here as S p , and calculating a 
correlation function for the highest-scoring amino acid sequences. In the preferred 
embodiment, S p is calculated using the following formula: 

20 S p = (IO*nf (l+prO-pVn, (1) 



where ^ are the matched intensities, r\ are the number of matched fragment ions, P is the 
25 type b and y ion continuity, p is the presence of immonium ions and their respective 
amino acids in the predicted sequence, and is the total number of fragment ions. The 
factor fj evaluates the continuity of a fragment ion series. If there is a fragment ion match 
for the ion immediately preceding the current type b or y ion, p is incremented by 0.075 
from an initial value of 0.0. This increases the preliminary score for those peptides 
3 o matching a successive series of type b and y ions, since extended series of ions of the 
same type are often observed in MS/MS spectra. The factor p evaluates the presence of 
immonium ions in the low mass end of the mass spectrum. 
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The detection of immonium ions may be used diagnostically to determine the 
presence of certain types of amino acids in the sequence. For example, if immonium ions 
are present at 1 10,0, 120.0, or 136.0 + 1.0 daltons in the processed data file of the 
unknown peptide with normalized intensities greater than 40.0, indicating the presence 
5 of histidine, phenylalanine, and tyrosine respectively, then the sequence under evaluation 
is checked for the presence of the amino acid indicated by the immonium ion. The 
preliminary score, S p , for the peptide is either increased or decreased by a factor of 1-p, 
where p is the sum of the penalties for each of the three amino acids whose presence is 
indicated in the low mass region. Each individual p can take on the value of -0. 1 5 if there 

10 is a corresponding low mass peak, and the amino acid is not present in the sequence, 
+0.15 if there is a corresponding low mass peak and the amino acid is present in the 
sequence, or 0.0 if the low mass peak is not present. The total penalty can range from 
-0.45, where all three low mass peaks are present in the spectrum, but are not present in 
the sequence, to +0.45, where all three low mass peaks are present in the spectrum, and 

15 are present in the sequence. 

Following the calculation of the preliminary closeness-of-fit score, S p , the 
predicted mass spectra having the highest S p scores are selected for further analysis using 
the correlation function. The number of candidate predicted mass spectra that are 
selected for further analysis will depend largely on the computational resources and time 

20 available. 

For purposes of calculating the correlation function, the experimentally-derived 
fragment spectrum is typically preprocessed in a fashion somewhat different from 
preprocessing employed before calculating S p . For purposes of the correlation function, 
the precursor ion is removed from the spectrum, and the spectrum is divided into 10 
2 5 sections. Ions in each section are then normalized to 50.0. The section-wise normalized 
spectra are then used for calculating the correlation function. The discrete correlation 
between the two functions may be calculated as: 



30 



n-l 

i=0 



(2) 
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where t is a lag value. The discrete correlation theorem states that the discrete correlation 
of two real functions x and y is one member of the discrete Fourier transform pair 



R.-X.Vt (3) 

5 

where X(t) and Y(t) are the discrete Fourier transforms of x(i) and y(i), and the Y* 
denotes complex conjugation. Therefore, the cross-correlations can be computed by 
Fourier transformation of the two data sets using the fast Fourier transform (FFT) 
algorithm, multiplication of one transform by the complex conjugate of the other, and 

10 inverse transformation of the resulting product. 

The predicted spectra as well as the pre-processed unknown spectrum may be 
zero-padded to 4096 data points, since the MS/MS spectra are not periodic, as intended 
by the correlation theorem, and the FFT algorithm requires N to be a integer power of 
two, so the resulting end effects need to be considered. The final score attributed to each 

1 5 candidate peptide sequence is R(0) minus the mean of the cross-correlation function over 
the range -75<t<75. This modified "correlation parameter", described in Powell and 
Heiftje, Anal. Chim. Acta, 100:313-327(1978), shows better discrimination over just the 
spectral correlation coefficient R(0). The raw scores are normalized to 1 .0. Preferably, 
the output includes the normalized raw score, the candidate peptide mass, the 

2 0 unnormalized correlation coefficient, the preliminary score, the fragment ion continuity 

P, the immonium ion factor x, the number of type b and y ions matched out of the total 
number of fragment ions, their matched intensities, the protein accession number, and the 
candidate peptide sequence. 

The correlation function can be used to select automatically one of the predicted 
25 mass spectra as corresponding to the experimentally-derived fragment spectrum. 
Preferably, however, a number of sequences from the library are output and final 
selection of a single sequence is done by a skilled operator. 

Depending on the computing and time resources available, it may be 
advantageous to employ data-reduction techniques. Preferably, these techniques will 

3 0 emphasize the most informative ions in the spectrum while not unduly affecting search 

speed. One technique involves considering only some of the fragment ions in the MS/MS 
spectrum, which, for a peptide, may contain as many as 3,000 fragment ions. According 
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to one data reduction strategy, the ions are ranked by intensity, and some fraction of the 
most intense ions is used for comparison. Another approach involves subdividing the 
spectrum into a small number of regions, e.g., about 5, and using the 50 most intense ions 
in each region as part of the data set. Yet another approach involves selecting ions based 
5 on the probability of those ions being sequence ions. For example, ions could be selected 
which exist in mass windows of 57 through 186 daltons, i.e., the range of mass 
increments for the 20 common amino acids from glycine to tryptophan that contain 
diagnostic features of type b or y ions, such as losses of 17 or 18 daltons, corresponding 
to ammonia and water, or a loss of 28 daltons, corresponding to CO . 

10 A number of different scoring algorithms can be used for determining preliminary 

closeness-of-fit or correlation. In addition to scoring based on the number of matched 
ions multiplied by the sum of the intensity, scoring can be based on the percentage of 
continuous sequence coverage represented by the sequence ions in the spectrum. For 
example, a 10 residue peptide will potentially contain 9 each of b and y type sequence 

15 ions. If a set of ions extends from B, to B 9 , then a score of 100 is awarded, but if a 
discontinuity is observed in the middle of the sequence, such as a missing B 5 ion, a 
penalty is assessed. The maximum score is awarded for an amino acid sequence that 
contains a continuous ion series in both the b and y directions. 

In the event that the described scoring procedures do not delineate an answer, an 

2 0 additional technique for spectral comparison can be used in which the database is initially 

searched with a molecular weight value and a reduced set of fragment ions. Initial 
filtering of the database occurs by matching sequence ions, and generating a score with 
one of the methods described above. The resulting set of answers will then undergo a 
more rigorous inspection process using a modified fiill MS/MS spectrum. 
25 For the second stage analysis, one of several spectral matching approaches 

developed for spectral library searching is used. This will require generating a "library 
spectrum" for the peptide sequence, based on the sequence ions predicted for that amino 
acid sequence. Intensity values for sequence ions of the "library spectrum" will be 
obtained from the experimental spectrum. If a fragment ion is predicted at m/z 256, then 

3 o the intensity value for the ion in the experimental spectrum at m/z 256 will be used as the 

intensity of the ion in the predicted spectrum. Thus, if the predicted spectrum is identical 
to the "unknown" spectrum, it will represent an ideal spectrum. The spectra will then be 
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compared using a correlation function. In general, it is believed that the majority of 
computational time for the above procedure is spent in the iterative search process. By 
multiplexing the analysis of multiple MS/MS spectra in one pass through the database, 
an overall improvement in efficiency will be realized. In addition, the mass tolerance used 
5 in the initial pre-filtering can affect search times by increasing or decreasing the number 
of sequences to analyze in subsequent steps. 

Another approach to speeding up searches involves a binary encryption scheme 
where the mass spectrum is encoded as peak/no peak at every mass depending on 
whether the peak is above a certain threshold value. If intensive use of a protein sequence 
10 library is contemplated, it may be possible to calculate and store predicted mass values 
of all sub-sequences within a predetermined range of masses so that at least some of the 
analysis can be performed by table look-up rather than calculation. 

EXAMPLES 

15 The following non-limiting examples are merely illustrative of the preferred 

embodiments of the present invention, and are not to be construed as limiting the 
invention, the scope of which is defined by the appended claims. 

EXAMPLE 1. 

20 The amino acid sequence of unknown Peptide X was determined using the 

method of the invention. The molecular mass of Peptide X was first determined using a 
matrix-assisted laser-description time-of-flight mass spectrometer (Voyager Elite, 
manufactured by Perseptive Biosystems) with delayed extraction and post source decay. 
As shown in Fig. 3, the mass of the protonated form of Peptide X form is 774.3928 

2 5 daltons, which indicates a mass of 773.3928 daltons for Peptide X. 

The set of amino acids that are possibly part of Peptide X were then defined for 
consideration in the analysis. The defined set of amino acids with the molecular mass of 
each amino acid less the mass of the one water molecule lost during peptide bond 
formation is provided below. The molecular masses are given in daltons or a.m.u. 



30 
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tryptophan = 


186.079313 


carbamide cysteine = 


160.03065 


arginine = 


156.10111 


phenylalanine = 


147.068414 


histidine = 


137.058912 


methionine = 


131.04085 


glutamic acid = 


129.042593 


lysine = 


128.094963 


glutamine = 


128.058577 


asparagine = 


114.042927 


aspartic acid = 


115.026943 


isoleucine = 


113.084064 


leucine = 


113.084064 


cysteine = 


103.009185 


threonine = 


101.047678 


valine ~ 


99.068414 


proline . = 


97.052764 


serine = 


87.032028 


alanine - 


71.037114 


glycine - 


57.021464 


tyrosine = 


163.063328 







The allowed combinations of amino acids for Peptide X were determined by first 
determining the molecular mass of Peptide X, as described above, to an experimental 
15 accuracy of 30 ppm (parts per million). Therefore, each allowed combination of amino 
acids in the allowed library must have a total mass of 773.3928 ± 30 ppm. In addition to 
providing the molecular mass of Peptide X, the first mass spectrum also confirmed the 
presence of certain amino acids in Peptide X. The immonium region of this mass 
spectrum, which shows the presence of these amino acids, is given in Fig. 4. In particular, 

2 0 the immonium region of the spectrum indicates the presence of arginine with a 

characteristic mass of 174.988, leucine/isoleucine with a characteristic mass at 85.885 1 
(these amino acids have the same mass, and are therefore not distinguishable by mass 
alone), histidine with a characteristic mass at 1 09.823, and tyrosine with a characteristic 
mass at 135.915. Therefore, it was possible to constrain the allowed library to sets 
25 containing arginine, leucine/isoleucine, histidine, and tyrosine, having a total molecular 
mass of 773.3928 ± 30 ppm. 

To determine the sets of amino acids that have a total molecular mass of 
773.3928 ± 30 ppm, the following equation was applied: 

MMjt = Z (histidine) + (tyrosine) + (leucine/isoleucine) + 

3 0 (arginine) + (H 2 0) + (aaj + — + (aaj, 
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where aa, — aa„ are any of the allowed amino acids, other than arginine, isoleucine, 
histidine, and tyrosine. The only combinations of amino acids that can have a total 
molecular mass of 773.3928 + 30 ppm are as follows: 

1) tryptophan, arginine, leucine/isoleucine, histidine, and tyrosine. 

2) glutamic acid, glycine, arginine, leucine/isoleucine, histidine, and 
tyrosine. 

3) alanine, aspartic acid, arginine, leucine/isoleucine, histidine, and 
tyrosine. 

These combinations constitute the allowed sets of amino acids for Peptide X. 

In addition, Peptide X was obtained by a tryptic cleavage, and, therefore, from 
the accepted specificity of trypsin, Peptide X must also have lysine or arginine as its 
carboxy terminal amino acid. With this constraint, the allowed library of linear peptides 
was constructed from all individual linear permutations of combinations 1, 2, and 3. 
The allowed library includes 528 linear peptides, one set of 264 peptides containing 
isoleucine (shown below) and a corresponding set of 264 peptides in which isoleucine is 
replaced by leucine (not shown). 
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1) Y1HWR 

2) IYHWR 

3) YfflWR 

4) HYIWR 

5) IHYWR 

6) HIYWR 

7) YIWHR 

8) IYWHR 

9) YWIHR 

10) WYIHR 

11) IWYHR 

12) WIYHR 

13) YHWTR 

14) HYWTR 

15) YWHIR 



54) HIYGER 

55) YIGHER 

56) IYGHER 

57) YGIHER 

58) GYTHER 

59) IGYHER 

60) GIYHER 

61) YHGIER 

62) HYGffiR 

63) YGHEER 

64) GYHffiR 

65) HGYTER 

66) GHYTER 

67) fflGYER 

68) fflGYER 



107) HGYEIR 

108) GHYEIR 

109) YEGfflR 

110) EYGHIR 

111) YGEHIR 

112) GYEHIR 

113) EGYHTR 

114) GEYHIR 

115) HEGYIR 

116) EHGYIR 

117) HGEYIR 

118) GHEYIR 

119) EGHYIR 

120) GEHYIR 

121) DHEGYR 



160) DYHIAR 

161) HDY1AR 

162) DHYIAR 

163) IHDYAR 

164) HIDYAR 

165) IDHYAR 

166) DIHYAR 

167) HDIYAR 

168) DfflYAR 

169) YTHADR 

170) IYHADR 

171) YH1ADR 

172) HY1ADR 

173) THY ADR 

174) HI Y ADR 



213) IADYHR 

214) AIDYHR 

215) DAIYHR 

216) ADIYHR 

217) YHDAIR 

218) HYDAIR 

219) YDHAIR 

220) DYHAIR 

221) HDYAIR 

222) DHYAIR 

223) YHADIR 

224) HYADIR 

225) YAHDIR 

226) AYHDIR 

227) HAYDIR 
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16) WYHIR 

17) HWYIR 

18) WHYIR 

19) MWYR 

20) HIWYR 

21) IWHYR 

22) WIHYR 

23) HWIYR 

24) WfflYR 

25) YTHEGR 

26) IYHEGR 

27) YHIEGR 

28) HYEGR 

29) IHYEGR 

30) HIYEGR 

31) YIEHGR 

32) IYEHGR 

33) YEIHGR 

34) EYfflGR 

35) EEYHGR 

36) EIYHGR 

37) YHEIGR 

38) HYEIGR 

39) YEfflGR 

40) EYfflGR 

41) HEYIGR 

42) EHYIGR 

43) IHEYGR 

44) fflEYGR 

45) EEHYGR 

46) EIHYGR 

47) HEIYGR 



69 
70 
71 
72 
73 
74; 
75; 
76; 
77 
78 
79 
80 
81 
82 
83 
84 

85 
86; 
87 
88 
89 
90 
91 
92 
93 
94 
55 
96 
97 
98 
99 



) IGHYER 
1) GEHYER 
) HGIYER 
) GHIYER 
) YTEGHR 
) IYEGHR 
) YEIGHR 
) EYIGHR 
) IEYGHR 
) EIYGHR 
) Y1GEHR 
) IYGEHR 
) YGffiHR 
) GYIEHR 
) IGYEHR 
) GIYEHR 
) YEGIHR 
) EYGBHR 
) YGEIHR 
) GYEIHR 
) EGYTHR 
) GEYIHR 
) EEGYHR 
) EIGYHR 
) 1GEYHR 
) GIEYHR 
) EGIYHR 
) GEIYHR 
) YHEGIR 
) HYEGIR 
) YEHGIR 



100)EYHGIR 



122) HTEGYR 

123) IEHGYR 

124) EIHGYR 

125) HEIGYR 

126) EfflGYR 

127) IHGEYR 

128) HIGEYR 

129) IGHEYR 

130) GIHEYR 

131) HGIEYR 

132) GfflEYR 

133) IEGHYR 

134) EIGHYR 

135) IGEHYR 

136) GIEHYR 

137) EGIHYR 

138) GEIHYR 

139) HEGIYR 

140) EHGIYR 

141) HGEIYR 

142) GHEIYR 

143) EGfflYR 

144) GEHIYR 

145) YIHDAR 

146) IYHDAR 

147) YfflDAR 

148) HYTDAR 

149) 1HYDAR 

150) HIYDAR 

151) YIDHAR 

152) IYDHAR 

153) YDIHAR 



175) YIAHDR 

176) IYAHDR 

177) YAIHDR 

178) AYIHDR 

179) IAYHDR 

180) AIYHDR 

181) YHA1DR 

182) HYAIDR 

183) YAfflDR 

184) AYfflDR 

185) HAYIDR 

186) AHYEDR 

1 87) IHAYDR 

188) fflAYDR 

189) IAHYDR 

190) AEHYDR 

191) HAIYDR 

192) AfflYDR 

193) YIDAHR 

194) IYDAHR 

195) YDIAHR 

196) DYIAHR 

197) IDYAHR 

198) DIYAHR 

199) YIADHR 

200) IYADHR 

201) YAIDHR 

202) AYIDHR 

203) IAYDHR 

204) AIYDHR 

205) YD AMR 

206) DYAIHR 



228) AHYDIR 

229) YDAHIR 

230) DYAfflR 

231) YADHIR 

232) AYDfflR 

233) DAYHIR 

234) ADYHIR 

235) HDAYIR 

236) DHAYIR 

237) HADYIR 

238) AHDYIR 

239) DAHYIR 

240) ADHYIR 

241) IHDAYR 

242) HIDAYR 

243) IDHAYR 

244) DIHAYR 

245) HDIAYR 

246) Dili AYR 

247) IHADYR 

248) HIADYR 

249) IAHDYR 

250) AIHDYR 

251) HAK>YR 

252) AHIDYR 

253) IDAHYR 

254) DIAHYR 

255) IADHYR 

256) AJQDHYR 

257) DAIHYR 

258) ADIHYR 

259) HDAIYR 
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48) EHIYGR 101)HEYGIR 

49) YIHGER 102)EHYGIR 

50) IYHGER 103)YHGEIR 

51) YHIGER 104)HYGEIR 
5 52) HYIGER 105)YGHEIR 

53) IHYGER 106)GYHEIR 
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154) DYIHAR 207) 

155) IDYHAR 208) 

156) DIYHAR 209) 

157) YHD1AR 210) 

158) HYDIAR 211) 

159) YDHIAR 212) 



YADIHR 260) HADIYR 

AYDIHR 261) HADIYR 

DAYIHR 262) AHDIYR 

ADYIHR 263) DAHIYR 

IDAYHR 264) ADfflYR 
DIAYHR 



The method of U.S. Patent No. 5,538,897 was then used to match Peptide X to 
this library by MS/MS. The experimental tandem mass spectrum of Peptide X is shown 
10 in Fig. 5, and the 10 top ranking peptides matched to this spectrum are provided below. 
It was determined that the sequence of Peptide X is that of the top ranked peptide, 
AHYDIR. 



Rank/Sp 


(M+H) 


Cn 


C*10 A 4 


Sp 


Ions 


Reference 


Peptide 


1/1 


774.9 


1 .0000 


1.8118 


491.0 


11/15 


p(228) 


(-)AHYDIR 


2/3 


774.9 


0.9308 


1.6864 


386.2 


10/15 


p(238) 


(-)AHDYIR 


3/2 


774.9 


0.8012 


1.4516 


414.3 


10/15 


p(227) 


(-)HAYDIR 


4/5 


774.9 


0.7319 


1.3262 


320.5 


9/15 


p(237) 


(-)HADYTR 


5/1 


774.9 


0.7168 


1.2987 


491.0 


11/15 


p(186) 


(-)AHYIDR 


6/12 


774.9 


0.6131 


1.1108 


248.3 


9/15 


p(226) 


(-)AYHDIR 


7/3 


774.9 


0.6033 


1.0930 


386.2 


10/15 


p(192) 


(-)AHIYDR 


8/9 


774.9 


0.5878 


1.0651 


264.1 


9/15 


p(225) 


(-)YAHDIR 


9/50 


774.9 


0.5850 


1.0599 


156.5 


7/15 


p(219) 


(-)YDHAIR 


10/14 


774.9 


0.5825 


1.0553 


247.9 


9/15 


p(217) 


(-)YHDAIR 
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EXAMPLE 2. 

The amino acid sequence of Peptide Y, a known, standard peptide, was 
determined using the method of the invention, as applied to Peptide X in Example 1 . 
Peptide Y has the following amino acid sequence: YGGFIRR. The molecular mass of 
30 Peptide Y was determined to be 868.4719 to an experimental accuracy of 30 ppm from 
the mass spectrum shown in Fig. 6. The masses at 1296.6854 and 1570.6774 are from 
internal standards, added to allow instrument calibration. 
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The set of amino acids that are possibly part of Peptide Y were then defined for 
consideration in the analysis. The defined set of amino acids with the molecular mass of 
each amino acid less the mass of the one water molecule lost during peptide bind 
formation is the same as those used in Example 1. 
5 As the mass of Peptide Y was measured as 868.47 1 9 to an experimental accuracy 

of ± 30 ppm, each allowed amino acid combination must therefore have a total mass 
equal to 868.4719 ± 30 ppm. In addition, from the immonium ion region of the PSD 
trace from Fig. 6, shown in Fig. 7, it was determined that Peptide Y must also contain 
the following amino acids: tyrosine with a characteristic mass at 136.027, phenylalanine 
10 with a characteristic mass at 120.071, arginine with a characteristic mass at 175.00, and 
leucine or isoleucine with a characteristic mass at 85.9225. 

Application of the equation in Example 1 demonstrated that only the following 
combinations of amino acids are allowed for Peptide Y: 

1) Tyrosine, phenylalanine, arginine, asparagine, and arginine. 
15 2) Tyrosine, phenylalanine, arginine, arginine, Ieucine/isoleucine, glycine, 

and glycine. 

3) Tyrosine, phenylalanine, arginine, Ieucine/isoleucine, alanine, alanine, and 
glutamine. 

4) Tyrosine, phenylalanine, arginine, Ieucine/isoleucine, glycine, valine, and 
2 0 asparagine 

5) Tyrosine, phenylalanine, arginine, Ieucine/isoleucine, glycine, glycine, 
glycine, and valine 

6) Tyrosine, phenylalanine, arginine, Ieucine/isoleucine, glycine, alanine, 
alanine, and alanine. 

2 5 These combinations constitute the allowed set of amino acid combinations for Peptide 
Y. 

In addition, Peptide Y was obtained by a tryptic cleavage, and, thus, from the 
accepted specificity of trypsin, Peptide Y must also have lysine or arginine as its carboxy 
terminal amino acid. With this constraint, the allowed library of linear peptides for 
30 Peptide Y is constructed from all individual linear permutations of the combinations 
above. The allowed library includes over 20,000 peptides, and is thus not shown. 
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As with Example 1, the method of U.S. Patent No. 5,538,897 was then used to 
match Peptide Y to this library by tandem mass spectrometry. The experimental tandem 
mass spectrum of Peptide Y is shown in Fig, 8, and the top 10 ranking peptides matched 
to this spectrum are given below. Of these ten, the top ranking peptide, YGGFIRR is 
5 known to be Peptide Y. 



Rank/Sp 


(M+H) 


Cn 


C A 4 


Sp 


Ions 


Reference 


Peptide 


1/3 


868. 


51.000 


1.894 


376.6 


11/24 


p(415) 


(-)YGGFFIR 


2/1 


868.5 


0.967 


1.831 


440.4 


11/24 


p(298) 


(-)YGGRIFR 


3/15 


868.5 


0.966 


1.830 


322.8 


11/28 


p(1975) 


(-)YGGFIGVR 


4/15 


868.5 


0.965 


1.828 


322.8 


11/28 


p(1735) 


(-)YGGFIVGR 


5/5 


868.5 


0.961 


1.821 


361.7 


11/24 


p(454) 


(-)YGGRFIR 


6/2 


868.5 


0.960 


1.819 


408.0 


11/24 


p(1311) 


(-)YGVNIFR 


7/12 


868.5 


0.951 


1.802 


333.7 


11/24 


p(1527) 


(-)YGVNFIR 


8/8 


868.5 


0.942 


1.783 


356.9 


11/28 


p(2153) 


(-)YGGGVIFR 


9/13 


868.5 


0.937 


1.775 


331.0 


11/24 


p(394) 


(-)YGGIFRR 


10/8 


868.5 


0.935 


1.771 


356.9 


11/28 


p(2147) 


(-)YGGVGIFR 



EXAMPLE 3. 

2 0 The amino acid sequence of Peptide Z, a known standard peptide, was 

determined using the method of the invention, as applied to Peptide X in Example 1 and 
Peptide Y in Example 2. Peptide Z has the following amino acid sequence: RPPGFSPFR. 
The molecular mass of Peptide Z was determined to be 1060.5660 to an experimental 
accuracy of 30 ppm from the mass spectrum shown in Fig. 9, The masses at 1 181.6477, 

25 1296.6933 and 1570.6774 are from internal standards, added to allow instrument 
calibration. 

The set of amino acids that are possibly part of Peptide Z were then defined for 
consideration in the analysis. The defined set of amino acids with the molecular mass of 
each amino acid less the mass of the one water molecule lost during peptide bond 
30 formation is the same as those used in Examples 1 and 2. 

As the mass of Peptide Z was measured as 1060.5660 to an experimental 
accuracy of 30 ppm, each allowed amino acid combination must therefore sum to a mass 
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equal to 1060.5660 ± 30 ppm. In addition, from the immonium ion region of the PSD 
trace from Fig. 9, shown in Fig. 10, it was determined that Peptide Z must also contain 
the following amino acids: phenylalanine with a characteristic mass at 120.20, arginine 
with a characteristic mass at 174.94, serine together with proline as deduced from the 
5 mass at 167.23, and glycine together with proline as deduced from the mass at 155.66. 

Application of the equation in Example 1 was used to determine the allowed 
combinations of amino acids for Peptide Z, and demonstrates that only the following 
combinations of amino acids are allowed for Peptide Y: 



10 



15 



20 



25 



PTIW + FRPSG 
WIW + FRPSF 
GQRR + FRPSG 
ANRR+FRPSG 
GO ARR+ FRPSG 
PPFR+FRPSG 
PIMR+ FRPSG 
VIER+ FRPSG 
VNQR+FRPSG 
CGVQR + FRPSG 



GAAAAR + FRPSG GWNK + FRPSG 
GPPVF + FRPSG ASPNK + FRPSG 
GPVIM + FRPSG 
APVVM + FRPSG 
AAIIE+ FRPSG 
GPTDS + FRPSG 
GWIE + FRPSG 
ASPIE + FRPSG 
APVTE+FRPSG 
AVWE + FRPSG 



AWID + FRPSG 
SPWD + FRPSG 
GGAAIK + FRPSG GVTNN + FRPSG 
GGGPTK + FRPSG AWNN + FRPSG 
GGGWK + FRPSG GGGVIN + FRPSG 
GAAAVK + FRPSG GAAAIN + FRPSG 
GGASPK + FRPSG GGAWN + FRPSG 
IQQQ + FRPSG 
GAIQQ+FRPSG 
AAVQQ + FRPSG 



AAAAVN + FRPSG 
SSPn+FRPSG 
SPVTI + FRPSG 



AAAQR + FRPSG GSPKK + FRPSG AAINQ + FRPSG GGGGGVI + FRPSG 
IIDR + FRPSG IQQK +FRPSG GWNQ + FRPSG GGGAAAAI + FRPSG 
INM + FRPSG GAIQK + FRPSG GGAAIQ + FRPSG PPTTT + FRPSG 
GGENR + FRPSG AAVQK + FRPSG GGGWQ + FRPSG PWTT+FRPSG 
GAVNR+FRPSG GSPQK+FRPSG AAAVQ+FRPSG GGGGAW + FRPSG 
GGGGIR + FRPSG AAINY + FRPSG GVI1D + FRPSG GGAAAAV + FRPSG 
GGGAVR + FRPSG GPTNK + FRPSG APTID + FRPSG AAAAAAA + FRPSG 



These combinations constitute the allowed set of amino acid combinations for Peptide 
Z. 

30 In addition, Peptide Z was obtained by a tryptic cleavage, and, from the accepted 

specificity of trypsin, Peptide Z must have lysine or arginine as its carboxy terminal amino 
acid. With this constraint, the allowed library of linear peptides for Peptide Z is 
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constructed from all individual linear permutations of the combinations above. The 
allowed library includes over 2,000,000 peptides, and is thus not shown. 

As with Examples 1 and 2, the method of U.S. Patent No. 5,538,897 was then 
used to match Peptide Z to this library by tandem mass spectrometry. The experimental 
5 tandem mass spectrum of Peptide Z is shown in Fig. 1 1, and the top 10 ranking peptides 
matched to this spectrum provided below. Of these ten, the top ranking peptide, 
RPPGFSPFR is known to be Peptide Z. 



Rank/Sp 


(M + EI) 


Cn 


C A4 


Sp 


Ions 


Reference 


Peptide 


1/1 


1061.2 


1.000 


3.310 


1163.5 


19/24 


p(135) 


(-)RPPGFSPFR 


2/2 


1061.2 


0.871 


2.884 


1 126.6 


19/24 


p(120) 


(-)RPPGFPSFR 


3/5 


1061.2 


0.857 


2.835 


824.7 


17/24 


p(122) 


(-)RPPFGPSFR 


4/11 


1061.2 


0.849 


2.811 


692.8 


16/24 


p(164) 


(-)RPPGFFPSR 


5/4 


1061.2 


0.833 


2.759 


831.2 


17/24 


p(189) 


(-)RPPGFFSPR 


6/3 


1061.2 


0.831 


2.749 


872.9 


17/24 


p(131) 


(-)RPPSFGPFR 


7/6 


1061.2 


0.819 


2.711 


797.1 


17/24 


p(126) 


(-)RPFGPPSFR 


8/12 


1061.2 


0.806 


2.668 


674.0 


16/24 


p(100) 


(-)RPPGPSFFR 


9/13 


1061.2 


0.792 


2.623 


668.4 


16/24 


p(137) 


(-)RFPPGSPFR 


10/14 


1061.2 


0.782 


2.588 


656.5 


16/24 


p(138) 


(-)RFGPPSPFR 



20 

While it is apparent that the invention disclosed herein is well calculated to fulfill 
the objectives stated above, it will be appreciated that numerous modifications and 
embodiments may be devised by those skilled in the art. Therefore, it is intended that the 
appended claims cover all such modifications and embodiments that fall within the true 
25 spirit and scope of the present invention. 
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CLAIMS 

1 . A method for determining the amino acid sequence of an unknown peptide, which 
comprises: 

(a) determining a molecular mass and an experimental fragmentation 
5 spectrum for the unknown peptide; 

(b) comparing the experimental fragmentation spectrum of the unknown 
peptide to theoretical fragmentation spectra calculated for a peptide library composed of 
all possible linear sequences of amino acids having a total mass that corresponds to the 
molecular mass of the unknown peptide; and 

10 (c) identifying a peptide in the peptide library having a theoretical 

fragmentation spectrum that matches most closely the fragmentation spectrum of the 
unknown peptide. 

2. The method of claim 1 , wherein the molecular mass for the unknown peptide is 
determined with an accuracy of up to about 30 parts per million. 

15 3. The method of claim 1 or claim 2, wherein the total mass of each of the possible 
linear sequences of amino acids is within the range of plus or minus about 30 parts per 
million of the molecular mass of the unknown peptide. 

4. The method of any preceding claim, further comprising calculating an indication 
of closeness-of-fit between the experimental fragmentation spectrum of the unknown 

20 peptide and each of the theoretical fragmentation spectra calculated for the peptide 
library. 

5. The method of claim 4, further comprising selecting peak values having an 
intensity greater than a predetermined threshold value when calculating the indication of 
closeness-of-fit. 

25 6. The method of any preceding claim, further comprising normalizing the 
experimental fragmentation spectrum. 

7. The method of any preceding claim, wherein the amino acids are selected from 
tryptophan, arginine, histidine, glutamic acid, glutamine, aspartic acid, leucine, threonine, 
proline, alanine, tyrosine, carbamido cysteine, phenylalanine, methionine, lysine, 

3 0 asparagine, isoleucine, cysteine, valine, serine, and glycine. 

8. The method of any of claims 1 to 6, wherein the amino acids comprise non- 
natural amino acids or chemically modified forms of the naturally occurring amino acids. 



WO 98/53323 



PCT/GB98/01486 



27 

9. The method of any preceding claim, wherein the unknown peptide has a 
molecular mass greater than about 1,400 Daltons. 

1 0. The method of any preceding claim, wherein the molecular mass for the unknown 
peptide is determined using a mass spectrometer. 

5 11. The method of claim 1 0, wherein the mass spectrometer is a time-of-flight mass 
spectrometer. 

12. The method of claim 10, wherein the molecular mass and the fragmentation 
spectrum for the unknown peptide are determined using a tandem mass spectrometer. 

13. A method according to any preceding claim, which additionally comprises the 
10 identification of one or more amino acids in the unknown peptide from its experimental 

fragmentation spectrum, or from its method of preparation, and using the one or more 
identified amino acids to constrain the library of all possible linear sequences. 

14. The method of claim 12, wherein the spectrum has an immonium ion region, and 
the immonium region is used to identify one or more amino acids contained in the 

15 unknown peptide, 

15. The method of claim 13 or claim 14, wherein the identification comprises 
comparing a known characteristic of amino acids with characteristics of the experimental 
fragmentation spectrum. 

16. The method of any of claims 13 to 15, wherein said one or more amino acids is 
2 0 or includes the N-terminal or C-terminal amino acid. 

17. A method of generating a library of amino acid sequences, wherein each sequence 
in the library represents a peptide having a molecular mass that corresponds to a single, 
predetermined molecular mass, which comprises 

defining a set of combinations of allowed amino acids having a molecular weight 

2 5 that corresponds to the predetermined molecular mass; and 

generating a library of all possible linear sequences of the amino acids in each 
combination of the set; 

wherein the library is constrained by identification as defined in any of claims 13 to 16. 

18. A method according to claim 1 7, additionally comprising the characteristic of any 

3 0 ofclaims2to 12. 
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